## Requirements
- Access to OpenAI Codex API: https://openai.com/blog/openai-codex/. The key should be placed in a file named `openai_api_key`
- pytorch: https://pytorch.org/
- Huggingface's transformers: https://pypi.org/project/transformers/
- tree-sitter-java: https://github.com/tree-sitter/tree-sitter-java
- tqdm
- tensorboard 

1. Script for data preprocessing
script_gen_and_preprocess_data.py 
This will produce an output file called commands_gen_and_preprocess. Running it will perform three things:
 - create the hole completion data by choosing the midpoint of each line as hole position [create_sample_data.py]
 - create a parse tree for each file as well as store repo-level meta-info needed to get prompt proposal context [parse_tree.py]
 - check for duplicates within a repo [check_duplication.py]
Running this will create a new folder called rule_classifier_data that has train, val and test subfolders. Inside each folder, we will have a folder for a repository that will contain the following:
 - The repository with .java files and preserving the directory structure.
 - hole_data
 - file_class_data
 - parsed_data
 - duplicates

2. Script for generating completions using Codex, i.e., obtaining the ground-truth for training the prompt proposal classifier( oracle).
script_completions.py
Generates a file commands_completion. Running this will create a new folder called results that has train, val and test subfolders. Inside each folder, we will have the ten folders corresponding to prompt sources. Each folder contains .json files corresponding to prompt context types. Each row of the file contains data about the application of that particular prompt proposal to a hole. It stores the target hole, predicted hole, the prompt and the validity of the prompt proposal.

3. Script for generating the oracle
script_analyze_results.py
Generates a file commands_analyze_results. Running this file will create a file called oracle inside each repo in rule_classifier_data.

4. Script for generating the prompt proposal context representations for PPC
generate_rule_representations.py
Example usage: python generate_rule_representations.py --data_split=val --repo=jata4test --emb_model_type=codebert
This will lead to creation of codebert_mod folder inside the path rule_classifier_data/val/jata4test. Each file in this folder contains the prompt context representation obtained from codebert for each hole.

5. Script for capping the number of holes
rearrange_data.py
This script will cap the maximum contribution from a repo to 10000 holes. After this, each repo folder will contain files capped_holes_10000, capped_codebert_mod and capped_oracle_10000

6. Script for training PPC
rule_classifier_preprocessed_data.py
This needs the capped_codebert_mod folder (prompt proposal context representations) to be present inside each repo folder as well as capped_oracle_10000 file.
The best model is stored in models directory along with the tensorboard logs. The output from each epoch is stored in the outputs folder.

7. Script for inference with PPC
rule_inference_preprocessed_data.py
This needs the capped_codebert_mod folder (prompt proposal context context representations) to be present inside each repo folder as well as capped_oracle_10000 file.
This produces a file inside the outputs folder that contains the prediction of the classifier for each hole.

8. Baselines
To generate completions for the baselines, use the following settings in generate_completions.py:
1. Random: `context_location = random_file`
2. Random NN: `context_location = random_file_NN`
3. Identifier Usage (Random): `context_location = identifier_usage_file_random`
4. Identifier Usage (NN): `context_location = identifier_usage_file_NN`
Then use analyze_separate_results.py to obtain the results for each by using the corresponding context_location values.

9. Script for getting variation with k numbers
get_info_from_predictions.py 
This needs a hole_stats_file as input (generated from the previous step) and a value of k.


NOTE: For ease of reproduction, we have provided the rule_classifier_data folder that contains the outputs from steps 1-3 above and first two output files of step 5. Note that you will still need to generate prompt proposal context representations using step 4 and cap it using step 5.


